Code to clean the data file-by-file

Importing the necessary libraries

In [1]:
import pandas as pd
import csv
import string
import re
import nltk

nltk.download('stopwords')
nltk.download('names')
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import word_tokenize
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
pd.set_option('display.max_colwidth', 150)

(A) Read the CSV File

In [3]:
df = pd.read_csv("C:\\Users\\Aruna\\Documents\\input\\Amazon EC2 Full.csv")

df['description'] = df['description'].apply(lambda x: " ".join(x for x in str(x).split())) # converting to string
 
df.head(10)
Out[3]:
id label description
0 87485 Amazon EC2 Very strange slow speed via http and https Hello All, We have many EC2 instances (and basically all of them perform in similar fashion, i don't th...
1 87485 Amazon EC2 Hello mrnch! I am wondering if your EBS volume is actually your bottleneck. The m3.xlarge instance type is not EBS optimized by default but suppor...
2 87485 Amazon EC2 Hello, Thank you for the insight. The same happens with iperf (which basically does not need to load from file). Furthermore, when i access the fi...
3 87484 Amazon EC2 status check fail every day Hi, We created A Amazon linux 2 instance and installed Wordpress by taking help of aws wordpress installing tutorials....
4 87484 Amazon EC2 Hi, Can anyone please look into the issue and suggest/fix the issue. - Thanks, Raj.
5 87483 Amazon EC2 EBS Modify Volume Type - Magnetic (Previous Generation) to GP2 We have done few Volume type modifications from Magnetic to GP2 without any downtim...
6 87483 Amazon EC2 Hello, I am glad to see that you have been able to successfully migrate from magnetic -- previous generation volume type to GP2 -- the latest gene...
7 87482 Amazon EC2 Default DNS server didn't match VPC cidr Hello, We have the VPC with CIDR 10.170.112.0/21. When I create an EC2 instance in this VPC with default ...
8 87482 Amazon EC2 Hello Ihor, Based on the IP CIDR you provided your internal DNS would be 10.170.112.2. Please attempt a DNS query from that IP address and let me ...
9 87482 Amazon EC2 Hello, Michael Yes, 10.170.112.2 works. But why it wasn't configured automatically as default? -Ihor
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295113 entries, 0 to 295112
Data columns (total 3 columns):
id             295113 non-null int64
label          295113 non-null object
description    295113 non-null object
dtypes: int64(1), object(2)
memory usage: 6.8+ MB

Check out one sample post:

In [11]:
p = 5100

df['description'][p]
Out[11]:
'Hi, ssh should be the least of your worries. There are a few things you could do. 1) Use your own keypair with at least 4096 bits and passphrase protection. 2) Allow your ip, ssh access to Security Group only when you need. You will need to login right - to install, configure, patch your servers 3) Recent changes in Lambda allow longer execution time. However, you should look at using a different service like AWS CodeDeploy to handle deployments. Also, there are lot of integration points here, you will need to plan for them. 1) Bitbucket to Lambda - How do you plan to invoke Lambda? Api gateway to Lambda ? 2) Where do you package the code (tar, war, jar) ? 3) Are you launching new servers for every deployment ? You mention startup script pulls code. You could plan for an In place or Blue/Green deployment. 4) If a new server scales up, will it know where to get the packaged code from ? In my case, Jenkins, pulls the code locally, packages it and hands it over to AWS Code Deploy to push to all servers in the auto scale group. Edited by: Bali on Oct 13, 2018 2:51 PM'

Top 30 words + frequency of each:

In [12]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[12]:
the         829601
to          653820
I           420858
a           334993
and         329776
is          292704
you         230144
instance    228155
for         215820
in          209328
that        201472
of          200452
on          189195
it          181985
this        167461
have        160429
not         128952
with        127970
be          122548
your        117599
can         111467
an          109761
from        106510
are         102812
my           95995
as           89804
but          87378
at           74245
or           73401
if           68191
dtype: int64
In [13]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words before cleaning.")
There are totally 22567450 words before cleaning.

(B) Text Pre-processing

In [14]:
STOPWORDS = stopwords.words('english')
my_stop_words = ["hi", "hello", "regards", "thank", "thanks", "regard", "best", "wishes", "hey", "amazon", "aws", "s3",
"elastic", "beanstalk", "rds", "ec2", "lambda", "cloudfront", "cloud", "front", "vpc", "sns", "me",
"january", "february", "march", "april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov",
"dec", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue",
"wed", "thu", "fri", "sat", "sun", "ain't", "aren't", "can't", "can't've", "'cause", "could've", "couldn't",
"couldn't've", "didn't", "doesn't", "don't", "hadn't", "hadn't've", "hasn't", "haven't", "he'd", "he'd've",
"he'll", "he'll've", "he's", "how'd", "how'd'y", "how'll", "how's", "i'd", "i'd've", "i'll", "i'll've", "i'm",
"i've", "isn't", "it'd", "it'd've", "it'll", "it'll've", "it's", "let's", "mayn't", "might've", "mightn't",
"mightn't've", "must've", "mustn't", "mustn't've", "needn't", "needn't've", "oughtn't", "oughtn't've", "shan't",
"sha'n't", "shan't've", "she'd", "she'd've", "she'll", "she'll've", "she's", "should've", "shouldn't", "shouldn't've",
"so've", "so's", "that'd", "that'd've", "that's", "there'd", "there'd've", "there's", "they'd", "they'd've", "they'll",
"they'll've", "they're", "they've", "to've", "wasn't", "we'd", "we'd've", "we'll", "we'll've", "we're", "we've",
"weren't", "what'll", "what'll've", "what're", "what's", "what've", "when's", "when've", "where'd", "where's",
"where've", "who'll", "who'll've", "who's", "who've", "why's", "why've", "will've", "won't", "won't've", "would've",
"wouldn't", "wouldn't've", "yall", "yalld", "yalldve", "yallre", "yallve", "youd", "youdve", "youll",
"youllve", "youre", "youve", "do", "did", "does", "had", "have", "has", "could", "can", "as", "is",
"shall", "should", "would", "will", "you", "me", "please", "know", "who", "we", "was", "were", "edited", "by", "pm"]

name = names.words()
STOPWORDS.extend(my_stop_words)
STOPWORDS.extend(name)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,:;#+?]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z - _.]+')
REMOVE_HTML_RE = re.compile(r'<.*?>')
REMOVE_HTTP_RE = re.compile(r'http\S+')

STOPWORDS = [BAD_SYMBOLS_RE.sub('', x) for x in STOPWORDS]

Convert to lowercase

In [15]:
df['description'] = df['description'].apply(lambda x: " ".join(x.lower() for x in str(x).split(" ")))

df['description'][p]
Out[15]:
'hi, ssh should be the least of your worries. there are a few things you could do. 1) use your own keypair with at least 4096 bits and passphrase protection. 2) allow your ip, ssh access to security group only when you need. you will need to login right - to install, configure, patch your servers 3) recent changes in lambda allow longer execution time. however, you should look at using a different service like aws codedeploy to handle deployments. also, there are lot of integration points here, you will need to plan for them. 1) bitbucket to lambda - how do you plan to invoke lambda? api gateway to lambda ? 2) where do you package the code (tar, war, jar) ? 3) are you launching new servers for every deployment ? you mention startup script pulls code. you could plan for an in place or blue/green deployment. 4) if a new server scales up, will it know where to get the packaged code from ? in my case, jenkins, pulls the code locally, packages it and hands it over to aws code deploy to push to all servers in the auto scale group. edited by: bali on oct 13, 2018 2:51 pm'

Remove all HTML tags

In [16]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTML_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[16]:
'hi, ssh should be the least of your worries. there are a few things you could do. 1) use your own keypair with at least 4096 bits and passphrase protection. 2) allow your ip, ssh access to security group only when you need. you will need to login right - to install, configure, patch your servers 3) recent changes in lambda allow longer execution time. however, you should look at using a different service like aws codedeploy to handle deployments. also, there are lot of integration points here, you will need to plan for them. 1) bitbucket to lambda - how do you plan to invoke lambda? api gateway to lambda ? 2) where do you package the code (tar, war, jar) ? 3) are you launching new servers for every deployment ? you mention startup script pulls code. you could plan for an in place or blue/green deployment. 4) if a new server scales up, will it know where to get the packaged code from ? in my case, jenkins, pulls the code locally, packages it and hands it over to aws code deploy to push to all servers in the auto scale group. edited by: bali on oct 13, 2018 2:51 pm'
In [17]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTTP_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[17]:
'hi, ssh should be the least of your worries. there are a few things you could do. 1) use your own keypair with at least 4096 bits and passphrase protection. 2) allow your ip, ssh access to security group only when you need. you will need to login right - to install, configure, patch your servers 3) recent changes in lambda allow longer execution time. however, you should look at using a different service like aws codedeploy to handle deployments. also, there are lot of integration points here, you will need to plan for them. 1) bitbucket to lambda - how do you plan to invoke lambda? api gateway to lambda ? 2) where do you package the code (tar, war, jar) ? 3) are you launching new servers for every deployment ? you mention startup script pulls code. you could plan for an in place or blue/green deployment. 4) if a new server scales up, will it know where to get the packaged code from ? in my case, jenkins, pulls the code locally, packages it and hands it over to aws code deploy to push to all servers in the auto scale group. edited by: bali on oct 13, 2018 2:51 pm'

Replace certain characters by space (quotation marks, parantheses etc)

In [18]:
df['description'] = df['description'].apply(lambda x: " ".join(REPLACE_BY_SPACE_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[18]:
'hi  ssh should be the least of your worries. there are a few things you could do. 1  use your own keypair with at least 4096 bits and passphrase protection. 2  allow your ip  ssh access to security group only when you need. you will need to login right - to install  configure  patch your servers 3  recent changes in lambda allow longer execution time. however  you should look at using a different service like aws codedeploy to handle deployments. also  there are lot of integration points here  you will need to plan for them. 1  bitbucket to lambda - how do you plan to invoke lambda  api gateway to lambda   2  where do you package the code  tar  war  jar    3  are you launching new servers for every deployment   you mention startup script pulls code. you could plan for an in place or blue green deployment. 4  if a new server scales up  will it know where to get the packaged code from   in my case  jenkins  pulls the code locally  packages it and hands it over to aws code deploy to push to all servers in the auto scale group. edited by  bali on oct 13  2018 2 51 pm'

Remove any unwanted symbols (like $, @ etc)

In [19]:
df['description'] = df['description'].apply(lambda x: " ".join(BAD_SYMBOLS_RE.sub('', x) for x in str(x).split()))

df['description'][p]
Out[19]:
'hi ssh should be the least of your worries. there are a few things you could do. 1 use your own keypair with at least 4096 bits and passphrase protection. 2 allow your ip ssh access to security group only when you need. you will need to login right  to install configure patch your servers 3 recent changes in lambda allow longer execution time. however you should look at using a different service like aws codedeploy to handle deployments. also there are lot of integration points here you will need to plan for them. 1 bitbucket to lambda  how do you plan to invoke lambda api gateway to lambda 2 where do you package the code tar war jar 3 are you launching new servers for every deployment you mention startup script pulls code. you could plan for an in place or blue green deployment. 4 if a new server scales up will it know where to get the packaged code from in my case jenkins pulls the code locally packages it and hands it over to aws code deploy to push to all servers in the auto scale group. edited by bali on oct 13 2018 2 51 pm'

Remove trailing punctuation marks and any symbol patterns

In [20]:
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('.') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('-') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('_') for x in x.split()))
df['description'][p]
Out[20]:
'hi ssh should be the least of your worries there are a few things you could do 1 use your own keypair with at least 4096 bits and passphrase protection 2 allow your ip ssh access to security group only when you need you will need to login right to install configure patch your servers 3 recent changes in lambda allow longer execution time however you should look at using a different service like aws codedeploy to handle deployments also there are lot of integration points here you will need to plan for them 1 bitbucket to lambda how do you plan to invoke lambda api gateway to lambda 2 where do you package the code tar war jar 3 are you launching new servers for every deployment you mention startup script pulls code you could plan for an in place or blue green deployment 4 if a new server scales up will it know where to get the packaged code from in my case jenkins pulls the code locally packages it and hands it over to aws code deploy to push to all servers in the auto scale group edited by bali on oct 13 2018 2 51 pm'

Remove any numbers

In [21]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

df['description'][p]
Out[21]:
'hi ssh should be the least of your worries there are a few things you could do use your own keypair with at least bits and passphrase protection allow your ip ssh access to security group only when you need you will need to login right to install configure patch your servers recent changes in lambda allow longer execution time however you should look at using a different service like aws codedeploy to handle deployments also there are lot of integration points here you will need to plan for them bitbucket to lambda how do you plan to invoke lambda api gateway to lambda where do you package the code tar war jar are you launching new servers for every deployment you mention startup script pulls code you could plan for an in place or blue green deployment if a new server scales up will it know where to get the packaged code from in my case jenkins pulls the code locally packages it and hands it over to aws code deploy to push to all servers in the auto scale group edited by bali on oct pm'

Remove the stop words

In [22]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if x not in STOPWORDS
                                                               and len(x) > 1))

df['description'][p]
Out[22]:
'ssh least worries things use keypair least bits passphrase protection allow ssh access security group need need login install configure patch servers recent changes allow longer execution time however look using different service like codedeploy handle deployments also lot integration points need plan bitbucket plan invoke api gateway package code war jar launching new servers every deployment mention startup script pulls code plan place blue green deployment new server scales get packaged code case jenkins pulls code locally packages hands code deploy push servers auto scale group bali'

Results after cleaning data:

In [23]:
df.head()
Out[23]:
id label description
0 87485 Amazon EC2 strange slow speed via http many instances basically perform similar fashion think got decent speed one machine m3.xlarge eucentral1a hosts web si...
1 87485 Amazon EC2 mrnch wondering ebs volume actually bottleneck m3.xlarge instance type ebs optimized default supports ebs optimization enable optimization instanc...
2 87485 Amazon EC2 insight happens iperf basically need load file furthermore access file another instance within region lightning fast suspecting drivers network ca...
3 87484 Amazon EC2 status check fail every day created linux instance installed wordpress taking help wordpress installing tutorials site running good faced status c...
4 87484 Amazon EC2 anyone look issue suggest fix issue raj

Top 30 words + frequency of each:

In [24]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[24]:
instance     331818
instances     79376
server        65505
using         58122
new           57552
volume        54863
running       54554
ms            50073
get           49083
one           47287
issue         47170
error         46511
use           44291
see           43717
ebs           43042
problem       41605
help          39684
need          39421
ssh           39400
time          39398
like          37974
also          35529
data          34428
still         33275
windows       31923
access        31733
root          31147
kernel        30824
file          30504
connect       30461
dtype: int64
In [25]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words after cleaning.")
There are totally 11539439 words after cleaning.

(C) Write to CleanText.csv

In [26]:
with open('C:\\Users\\Aruna\\Documents\\ACMS-IID\\input\\CleanText.csv', 'a', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # writer.writerow(['id', 'label', 'description'])
    for i in range(0, len(df['description'])):
        if len(df['description'][i]) > 1:
            writer.writerow([df['id'][i], df['label'][i], df['description'][i]])

(D) Generate the word cloud

In [27]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 20, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[27]:
(-0.5, 399.5, 199.5, -0.5)
In [28]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 50, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[28]:
(-0.5, 399.5, 199.5, -0.5)
In [29]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 100, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[29]:
(-0.5, 399.5, 199.5, -0.5)
In [30]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 500, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[30]:
(-0.5, 399.5, 199.5, -0.5)
In [31]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 1000, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[31]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: